There are many failure scenarios in any distributed system. Ditto Server leans heavily on a durable transaction log for many failure scenarios, and replicated copies of data for many others.
p1r2
has become unresponsive. Remove it from the Current Config, create a Next Config with a new server to take the place of** **p1r2
, store the configs in the Strongly Consistent metadata store, and signal the nodes. The new node will begin to consume transactions and backfill, and the UST will rise etc.
IntervalMaps
that some transaction T** has been missed. After doing a strongly consistent read of the metadata store, to check that no server in the next config exists that may have the data, the replicas agree unilaterally to pretend that really they did store this data, and they splice it into their IntervalMaps
. The UST rises, and progress is made.
It is essential to understand this is a disaster scenario, and not business as usual, but disasters happen, and they should be planned for. We do everything we can to never lose data, including a replicated durable transaction log with a long retention policy.